# Create a new SQLite database with a name in my database folder
db <- dbConnect(RSQLite::SQLite(),
"database/assignment3_jiayu.sqlite")
# Check if the database file exists
db_exists <- file.exists("assignment3_jiayu.sqlite")
# Display the output of the check
cat("Does the database exist?", db_exists, "\n")
## Does the database exist? FALSE
Now I have an empty relational database
assignment3_jiayu in my local folder
database.
The function is to gather structured data of R1 and R2 universities in the US from the Wikipedia page:
Now let’s use the function scrape_R1R2_unis() to scrape
R1 and R2 universities in the US and their key information.
The outcome Table 1 shows the combined R1 and R2
universities in the US. For each university, the table contains its
name, status (public or private), the city in which it is located, the
state in which it is located and the URL of the university’s dedicated
Wikipedia page.
The function is to gather unstructured data from the Wikipedia page with an argument of a dataframe storing the URLs to scrape:
It is time to use the function scrape_dedicated_wiki(df)
to scrape R1 and R2 universities’ geographic coordinates, endowment and
total number of students.
The table Table 2 adds to Table 1 the
geographic coordinates of the (main) university campus, the endowment of
the university in USD dollars and the total number of students
(including both undergraduate and postgraduate).
Here I clean the ivyleague.csv file and merge it with
Table 2 to get three new variables:
1.An indicator for whether the university is an Ivy League institution
2.The university’s county (it would be wise to concatenate both county and state into a single string, separated by “,”)
3.The university’s EIN (which can be missing for those universities not in the Ivy League)
After merging, I count the missing values in each column.
The result Table 3 includes, in addition to
Table 2, an indicator for whether a university is an Ivy
League institution. If yes, there should be the university’s county and
EIN; otherwise, there are missing values.
From List 1, we can see that the number of missing
values in Endowment (in USD) and Total Student Number is mild (9, 1,
respectively). After checking them back in the webpages, I found they
were not provided indeed, so I left them missing.
As for Geographic Coordinates, the number of missing values is also minimal (2). I checked the missing values back on the webpages and found that they are not directly provided but could be accessed on the dedicated webpage of their city location. However, it would be a massive workload for a single sub-question. Therefore, I chose not to deal with them this time.
When it comes to EIN and County, the number of missing values is correct because there are only 8 Ivy League Institutions.
The function is to check for the existence and correct dimensionality of my written table, which takes two arguments: the name of my database, and the name of my table:
# Function to check the existence and correct dimensionality of the written table
# The two arguments are the database's name and the table's name
check_table_dimensions <- function(db_name, table_name) {
# Connect to the database
con <- dbConnect(RSQLite::SQLite(), dbname = db_name)
# Check if the table exists
if (dbExistsTable(con, table_name)) {
# Retrieve information about the table
tbl_info <- dbReadTable(con, table_name)
# Close the database connection
dbDisconnect(con)
# Return information about the table
return(list(
rows = nrow(tbl_info),
columns = ncol(tbl_info),
column_names = colnames(tbl_info)
))
} else {
# Close the database connection
dbDisconnect(con)
# Return a message if the table does not exist
return("Table does not exist.")
}
}
## $rows
## [1] 279
##
## $columns
## [1] 11
##
## $column_names
## [1] "University" "Status" "City"
## [4] "State" "URL" "Geographic.Coordinates"
## [7] "Endowment..in.USD." "Total.Student.Number" "County"
## [10] "EIN" "In.Ivy.League."
According to the outcome List 2, my written table
R1_and_R2_universities_in_the_US based on
Table 3 exists in my local relational database
assignment3_jiayu.sqlite, whose number of rows is 279,
number of columns is 11 and column names are correct. Therefore, the
written table is exactly the one I want and have written in.
Note: I replace all the whitespaces in the table name with the underscore “_” to facilitate SQL query later. The same goes for the following tables as well.
I create here a webscraper function that returns, for the Ivy League university only:
The ARWU ranking for the university for the years 2003, 2013, and 2023. If the university’s rank is given as a range e.g. 76-100, convert this to the midpoint of the range – in this case 88.
The final table is in tidy long format, where each row uniquely identifies a combination of university and year (e.g., Harvard-2003).
Run the webscraper scrape_annual_rank(years, unis).
From the outcome Table 4, I successfully scraped the
ARWU world and national ranks of the 8 Ivy League universities in 2003,
2013 and 2023 and stored them in a tidy long format dataframe.
# Check the existence and dimensions of the table
exsitAndDimen_3a <- check_table_dimensions("database/assignment3_jiayu.sqlite",
"ARWU_annual_ranks_for_Ivy_League_universities_in_2003_2013_and_2023")
exsitAndDimen_3a
## $rows
## [1] 24
##
## $columns
## [1] 4
##
## $column_names
## [1] "University" "Year" "World.Rank" "National.Rank"
According to the outcome List 3, my written table
ARWU_annual_ranks_for_Ivy_League_universities_in_2003_2013_and_2023
based on Table 4 exists in my local relational database
assignment3_jiayu.sqlite, whose number of rows is 24,
number of columns is 4 and column names are correct. Therefore, the
written table is exactly the one I want and have written in.
I generate here a webscraper function that gathers from ARWU for each Ivy League university only:
The rankings of the university for every social science for which the university has been ranked. Again, if a range is given, take the midpoint.
My final table is in tidy long format, where each row uniquely identifies a combination of university and discipline (e.g., Harvard-Economics).
Run the webscraper
scrape_subject_rank(subjects_vector, university_vector).
As you can see from Table 5, I successfully scraped the
ARWU subject ranks for every social science of each Ivy League
university in 2023 and stored them in a tidy long format dataframe.
If a university does not have a rank for a particular subject, I would leave it a missing value so that the table is uniform.
# Check the existence and dimensions of the table
exsitAndDimen_3b <- check_table_dimensions("database/assignment3_jiayu.sqlite",
"ARWU_2023_subject_ranks_for_social_sciences_of_Ivy_League_universities")
exsitAndDimen_3b
## $rows
## [1] 112
##
## $columns
## [1] 3
##
## $column_names
## [1] "University" "Subject" "Rank"
According to the result List 4, my written table
ARWU_2023_subject_ranks_for_social_sciences_of_Ivy_League_universities
based on Table 5 exists in my local relational database
assignment3_jiayu.sqlite, whose number of rows is 112,
number of columns is 3 and column names are correct. Therefore, the
written table is exactly the one I want and have written in.
For each Ivy League university, I gather financial data from the
ProPublica API. Using httr, I access the
Organization Method endpoint for each Ivy League university
with their EIN to gather the following variables for the years 2010 -
2020:
1.Total revenue
2.Total assets
After retrieving these data, I format them in a tidy long format, where each row is a unique combination of university and year (e.g., Harvard-2020).
The output Table 6 has 4 columns: University, Year,
Total Revenue and Total Assets, where each row is a unique combination
of university and year.
If a university does not have data for a particular year, I would leave it a missing value so that the table is uniform.
# Check the existence and dimensions of the table
exsitAndDimen_4a <- check_table_dimensions("database/assignment3_jiayu.sqlite",
"Ivy_League_universities_financial_data_from_2011_to_2021")
exsitAndDimen_4a
## $rows
## [1] 88
##
## $columns
## [1] 4
##
## $column_names
## [1] "University" "Year" "Total.Revenue" "Total.Assets"
According to the results List 5, my written table
Ivy_League_universities_financial_data_from_2011_to_2021
based on Table 6 exists in my local relational database
assignment3_jiayu.sqlite, whose number of rows is 88,
number of columns is 4 and column names are correct. Therefore, the
written table is exactly the one I want and have written in.
Using the package tidycensus, I retrieve the names of
all the Counties in the US and their estimated median household income
for every county for both 2015 and 2020 (based on the American Community
Survey (ACS)).
As shown in Table 7, I obtain the names of the counties
in which Ivy League universities are located and their estimated median
household income in 2015 and 2020 (based on the American Community
Survey (ACS)) and put them together in a tidy long format, where each
row is a unique combination of university and year.
# Check the existence and dimensions of the table
exsitAndDimen_4b <- check_table_dimensions("database/assignment3_jiayu.sqlite",
"Ivy_League_universities_counties_and_their_estimated_median_household_income_in_2015_and_2020")
exsitAndDimen_4b
## $rows
## [1] 16
##
## $columns
## [1] 4
##
## $column_names
## [1] "University" "Year"
## [3] "County" "Estimated.Median.Household.Income"
According to the output List 6, my written table
Ivy_League_universities_counties_and_their_estimated_median_household_income_in_2015_and_2020
based on Table 7 exists in my local relational database
assignment3_jiayu.sqlite, whose number of rows is 16,
number of columns is 4 and column names are correct. Therefore, the
written table is exactly the one I want and have written in.
After completing Exercises 1 - 4, I have five distinct tables in my
relational database assignment3_jiayu.sqlite. My goal is
now to bring the data stored in the 5 tables together in a variety of
ways using SQL, and then analyse the data using R.
First, I use SQL to query the 5 tables to get the data I need.
I have included in Table 8 the following variables for
each Ivy League institution:
Now it is time to do the data visualisation with
ggplot.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
I draw the relationships between the variables as required in the question. Here are my comments on these relationships (for Ivy League institutions only) :
In Figure 1.1, we can see a positive relationship
between a university’s average ranking and average Econ/PS/Soc ranking.
It is reasonable as the average ranking can represent this university’s
overall strength, and a higher average rank (smaller in rank number)
makes it more likely to perform well in a particular subject as
well.
In Figure 1.2, a negative relationship exists between a
university’s average ranking and endowment per student. It also matches
the common sense that a better university (with a smaller rank number)
would have more outstanding alums who are more likely to endow their
mother university. However, it is interesting that most (6 in 8) of the
Ivy League institutions’ average rankings are within 25, but their
endowment per student numbers vary a lot; it is significant to consider
other factors when looking at these elite universities, such as the
total student number and the subjects they are good at.
In Figure 1.3, there is a positive relationship between
a university’s average endowment per student and the average median
household income where they sit. Since these universities enrol
nationwide students, the relationship is a little surprising. One
possible explanation could be that the universities whose data points
are near the trend line enrol more local students, while those far from
the line have a more diverse student background.
In Figure 1.4, there is a slightly positive relationship
between a university’s average total revenue per student and average
median household income. If looking back Figure 1.3, it is
interesting that its number of outliers is the same as that in
Figure 1.4. If the outliers in Figure 1.3 are
exactly that in Figure 1.4, I’d love to draw a preliminary
conclusion that 5 Ivy League institutions are more localised, while the
other three welcome more students nationwide.
I use SQL to query the table
R1_and_R2_universities_in_the_US to get the data I
need.
Table 9 gained from the query consists of 4 variables
for every R1 and R2 university:
To enable turning Table 9 into a spatial object for
plotting, I first turn my character coordinates into numeric ones
identifiable by st_as_sf().
Let’s do the data visualisation with it using tmap.